HTTX

HTML to TEXT converter
Created by Gabriele Favrin
(E-Mail: favrin@tin.it - FidoNet: 2:333/726.8)

Version 1.7 - January 1998



Index:



*** CHILDWARE ***

This software is "CHILDWARE". The author explicitly asks whoever uses this program to make a donation toward a beneficial corporate body which works in help of the children, in any form or way.

If you don't know any, ask your local post office and inform yourself on how to make a payment to The UNICEF.

The amount of the offer is up to you , but please do it!




Utilization terms

Before running this program on your computer, please read carefully the following paragraph, and continue only if you agree with the terms written below.

THE HTTX AUTHOR IS IN NO WAY RESPONSIBLE FOR MORAL AND/OR MATERIAL DAMAGES THAT HIS PROGRAM MAY CAUSE TO PEOPLE OR THINGS. THE PROGRAMMER GAVE HIS BEST TO LIMIT THE PROBLEMS THAT HTTX MAY CAUSE, BUT HE IS NOT ABLE TO GUARANTEE ITS EFFICIENCY IN ALL THE SITUATIONS. USING HTTX, YOU, THE USER, ARE RESPONSIBLE FOR ALL MORAL, MATERIAL, CIVIL AND PENAL THINGS.

WARNING:

Many HTML documents are under Copyright, and are not freely distributed, even if converted to plain text format. The author declines every responsibility in the utilization of the files generated with HTTX.

All of the programs mentioned in this document are properties of their respective owners.



Properties of HTTX and distribution terms

The executable program, the source code and the ideas that are its basis are the EXCLUSIVE PROPERTY of Gabriele Favrin. All rights reserved.

HTTX is freeware, NOT public domain. It may be spread only if the executable files and the documentation remain unchanged. Distribute of the files in archive formats other from LhA is permitted, but compressing the individual files using PowerPacker or similar tools is not.

The insertion of HTTX or its parts in the cover disks of magazines is granted only by authorization from the author.

Commercial utilization of this package is exclusively granted to AmiTrix and Yvon Rozijn (AWeb).

Aminet, Fred Fish, Meeting Pearls, Amy Resource and CU Amiga Magazine staffs are authorized to include HTTX in their public domain software collections.



Introduction

HTTX (HTml > TXt) is a program to convert files from HTML format, used for viewing files on World Wide Web, to pure ASCII. There are analogous products, but since none had completely satisfied my needs, I started to write one myself.

I don't say this is the best or the fastest one, but surely it has some functions unpublished in similar Amiga programs till now.



Hardware requirements and installation

Required system:
Amiga, 512K, Kickstart 2.04 (37.175) or above.

Required memory:
The size of the file to convert to, and about a 15K, for buffers and other.

Install:
Copy HTTX to your C: directory or a directory in your current path.



How to use

HTTX MUST be launched from Shell.

Command syntax:

HTTX InputFile [OutputFile] [options]

The parameters in square brackets are optional. You are only required to specify a valid HTML file ("InputFile").

If there is no OutputFile specified, it defaults to 'InputFile'.txt (e.g. "test.html" will be saved as "test.html.txt"). If a path is specified for OutputFile, that file will be saved to that path.

Examples:

HTTX data:txt/html/aboxe.html

The file "aboxe.html" will be converted and saved as "data:txt/html/aboxe.html.txt"

HTTX data:txt/html/aboxe.html ram:aboxe.txt

The file "aboxe.html" will be converted and saved as "ram:aboxe.txt"

HTTX data:txt/html/aboxe.html data:txt/

The file "aboxe.html" will be converted and saved as "data:txt/aboxe.html.txt"



Command line parameters

HTTX offers many options to control the conversion process.

LEN
Maximum line length of the output file.

Default: 77 - Minimum: 15 - Maximum: 255

INDENT or IN
Number of spaces for indentation (re-enter to the right) of the <UL> and <DL> lists. The specified value must allow at least two levels of indentation regarding of the line length specified with the LEN option.

Default: 3 - Minimum: 1 - Maximum: (LEN value - 10) / 3

ANSI or AN
ANSI conversion of HTML styles and LINKS (HREF and NAME) and optimization of alignment functions.

Not to be used if the converted text will go on message areas, like Fidonet or Usenet newsgroups.

Please read the "ANSI conversion" section for important informations about ANSI sequences and general compatibility issues.

Default: OFF (styles are not converted).

7BIT
Conversion of HTML entities (accent letters, symbols, and so...) to ASCII codes lower than 128. This is required for text forwarded on nets like FidoNet, where the character codes allowed can only range between 32 and 127.

IMPORTANT: the ANSI option adds Escape codes (ASCII 27), forbidden on FidoNet, and strongly not recommended for a non personal use (broadcast) of converted text.

Default: OFF (8-bit chars are not converted).

HRMODE or HR
HTML documents often contains the <HR> TAG, which defines a separating line between paragraphs. HTTX allow the management of these lines in many ways:

HRMODE=0
No lines drawn.
It was NOHR in HTTX versions before 1.5.

HRMODE=1
Lines are drawn using the minus "-" character.

HRMODE=2
Lines are drawn using underlined spaces (in ANSI). This mode generates a nicer line, but introduces ANSI codes, it is absolutely to be avoided if the text will go on Fidonet or Usenet newsgroups.

Default: HRMODE=1 (lines are inserted using the minus "-" character).

NOALIGN or NA
HTTX supports (right or center) alignments of texts and separators (<HR>).

Examples:

centered text
right-aligned text

If NOALIGN option is ON, both the above lines will start on left margin, this saves characters.

Default: OFF (alignment is rightly converted).

FILENOTE or FN
Saves the document title (<TITLE>) as the output file comment. Only the first 64 characters are saved, as HTML standard.

Default: OFF (the title is not saved as the comment, but it can still appear inside the output file).

SITE
Adds the specified source URL to output file.
It may be useful to know which document the file was created from.

Example:
HTTX ram:children.html SITE=http://www.unicef.org

will start the file with "URL  : http://www.unicef.org"

Note: SITE has priority over GETNOTE, so specifying a site this way will override that option if it is active.

Default: OFF (without this option the URL will not be added).

GETNOTE or GN
Uses the InputFile comment as the URL. This option is alternative to SITE and is useful when files created using the AWeb browser, and other browsers which save the URL in the comment. The AmigaDOS comment length limit is 80 chars.

Default: OFF.

NOHEADER or NOHEAD
Don't insert HTTX version information, title (<TITLE>) and URL in the converted file. This option automatically turns off the SITE and GETNOTE options if present.

Default: OFF (HTTX version, optional title and URL may be added).

HREF or LINK
Adds addresses of links (<A HREF> TAG) to output file..
Very useful if the document contains links that you want to keep.

Default: OFF (links aren't added).

IMG
Adds the ALT-text of images (<IMG> TAG) to output file.
Useful if the document contains images with descriptions.

Default: OFF (ALT-text isn't added).

SCRIPT
Adds the content of <SCRIPT> element, eg. JavaScript, to output file.
Note: this option adds the script itself, not its result!

Please read the "notes about conversion of <PRE>, <XMP>, <LISTING> and <SCRIPT> contents" section for important information about conversion of this type of text.

Default: OFF (<SCRIPT> content is skipped).

BADHTML or BAD
Partial support for documents created outside of the HTML standards. Use this option only if parts of the converted page is missing. Using this option with correct HTML documents may cause unpredictable results in the converted document.

Default: OFF (HTTX uses standard DTD rules).

FORCE
Forces conversion of input file without checking if it is an HTML document.

USE IT AT YOUR OWN RISK: conversion of text or binary files may cause unpredictable results.

Normally, HTTX considers a file valid HTML when:

This option must be specified if the three above conditions are false, even if the file IS an HTML document.

Default: OFF (automatic check of file).

STDIO
Show the converted file on the screen intstead of saving it on disk. This option automatically enables the QUIET option.

Default: OFF (converted file is saved to disk).

PRINT
Prints the document instead of displaying or saving it.

The printer.device will convert standard ANSI codes and end-of-lines to the ones used by the Printer set in your Preferences.

This option should be used if you want to print the converted document, especially if the ANSI option is enabled, because ANSI codes used for conversion are different from the printer ones, which are more generic.

Older versions of HTTX used a solution like "HTTX aboxe.html prt:", which is now to avoid.

This option automatically enables QUIET, and turns off FILENOTE and STDIO options.

Default: OFF (document is displayed on screen or saved to a file).

APPEND
Normally HTTX will overwrite an existing file.

If APPEND is ON, the converted text will be added to the end of the specified file.

Default: OFF (overwrite output file if it already exists).

NOCFG
HTTX loads a default configuration from ENV:httx.prefs (if another is not specified with the CFG option). If this option is ON, HTTX uses the default values for the options or the parameters specified in the command.

For more informations see "external configuration" section.

Default: OFF (HTTX searches for its configuration).

CFG
With this option it is possible to specify the name of the configuration file to be used by HTTX. This file MUST be located in the ENV: directory.

This option turns NOCFG OFF.
For more informations see "external configuration" section.

Default: OFF (HTTX loads the httx.prefs configuration file).

INCLUDE
Using this option it is possible include text at the start of each output file, before the converted HTML data.

The text file is NOT ALTERED IN ANY WAY, no 8-bit character conversion, wordwrap, ANSI codes and so on. HTTX will not warn if 8-bit characters are included.

Remember this, especially if the converted text will go on message areas, like Fidonet or Usenet newsgroups, where 8-bit chars are not allowed.

Default: OFF (no text file is included in the output file).

QUIET
Don't display any HTTX messages. This option is useful when HTTX is used within a script.

WARNING: if active, this option also hides error messages, but the AmigaDOS error codes are always returned.

Default: OFF (HTTX output is displayed).

If not specified, HTTX uses the default settings.

When conversion is finished, if QUIET option is OFF, HTTX will show:



External configuration

HTTX supports an external configuration, this is a text file that includes the most used options, so they do not need to be typed every time you use HTTX.

By default (except when NOCFG option is set, or CFG option with a different filename) HTTX searches the file "ENV:httx.prefs". It's possible to create multiple configurations, maybe one to use for file conversion and another one to use for printing, creating different configuration files and enabling the CFG option with the name of the file (do not specify the path, it is always "ENV:"). Example:

HTTX aboxe.html

Converts "aboxe.html"; using default configuration (ENV:httx.prefs).

HTTX aboxe.html PRINT CFG=httxprt.prefs

Converts "aboxe.html" using the configuration file ENV:httxprt.prefs

Allowed parameters

The external configuration file supports a subset of available command line options. Each option MUST be specified in its extended form (for example ANSI, not AN, INDENT instead of IN, and so on).

The file must contain ONLY the options and their possible parameters. It's allowed to put each option on a separate line for better readability.

Available options (for description see "Command line parameters" section):

LEN
the maximum length for output lines.
INDENT
the indentation size.
ANSI
allow the use of ANSI codes in conversion.
7BIT
convert HTML entities with ASCII code >127 to 7 bit chars.
HRMODE
line drawing mode.
NOALIGN
ignore center and right alignment.
FILENOTE
save HTML document title as the file comment.
GETNOTE
use the source file comment as URL in destination file.
NOHEADER
don't output the HTTX version, URL and title of original HTML document.
HREF
add links (<A HREF>) to destination file.
IMG
add ALT-Text of images to destination file.
SCRIPT
add the contents of <SCRIPT> elements to the destination file.
BADHTML
partial support for badly written HTML.

Parameters specified on command line acts after the parameters specified in the configuration file. This can eventually override (or toggle twice) one or more options.

Examples:

If a configuration file has the following line:

IMG GETNOTE LEN=70

and on command line you write:

HTTX aboxe.html IMG

the result is IMG turned on because it's present in configuration line, but turned off again because it's also present in command line.

HTTX aboxe.html LEN=74

LEN is both present in configuration file and command line, but this one overrides the previous value. LEN is now set to 74.

How to create an external configuration

External configuration files are in effect system variables and are located in ENVARC: directory (on disk) and ENV: (generally on RAM). So, the contents in ENV: are valid only for the current session, while the contents in ENVARC: are also valid after a reset.

To permanently save a configuration file, copy it both in ENV: and ENVARC:.

Use your favorite text editor (Ed, Cygnus Editor, GoldEd, and so on) to create your prefs file, httx.prefs is the default filename. Save it in ENV:, also save the file to ENVARC: so it will not be lost when you reboot. Temporary changes can be made by editing just the ENV: file.

HTTX configuration may be fully managed using the plugin for the AWeb WWW browser.



Error Messages and AmigaDOS Return Codes

When execution terminates, HTTX returns the appropriate AmigaDOS Return Code (RC), usable within scripts to determine if the conversion was successful. See your AmigaDOS handbook for a complete list of error codes.

In case of error, if QUIET is off, the appropriate AmigaDOS message will be displayed.

Following is a list of the most common errors. If the system is localized, messages are displayed in the appropriate language. See your AmigaDOS manual for further information.

Argument line invalid or too long
Arguments entered in a wrong way.

*** Break
The user has pressed Control-C keys, interrupting the conversion, and the output file has been removed.

Not enough memory available
There is no memory available to allocate the buffers used by HTTX. This can happen if the HTML file to convert or the text file to include is too big or the memory is too fragmented.
Try rebooting your Amiga.

Object not found
Specified file doesn't exist or it's not accessible.
Check file name and path.

Object is not of the required type
The input file seems non to be an HTML document.
Try using the FORCE option.

HTTX can shows other errors (in English only) due to wrong use of commands or options:

The line length must be between 15 and 255 characters (current is NN)
The line size specified with LEN parameter is a number less than 15 or more than 255.

Indentation size must be at least 1 (current is NN)
The indentation size must be at least one character.

With line length XX, indentation size YY, max indent level is ZZ.
You must allow at least 3 indentations.
The maximum indentation level, with specified line size and indent value, is less than 3.

HRMODE value must be 0, 1 or 2
Value set for HRMODE is not valid. It must be 0, 1 or 2.

Finally, there are a few warnings which may be displayed. The conversion will take place but there may be situations altering the final result:

Error in env config. HTTX will use its defaults
External configuration file has errors. HTTX will use the default settings and the parameters passed on the command line.
Check external configuration file.

ENV config 'NAME' not found.
The configuration file specified with CFG option was not found.
HTTX will use the default settings and the parameters passed on command line.
Remember that the configuration file must be located in ENV:. Remember also to copy it again to ENVARC: when you edit it.

Found non-ASCII chars in preformatted text!
In a non formatted text section HTTX has found some 8-bit characters.
Do not ignore this warning if the converted text will be posted in Fidonet conferences or Usenet newsgroups.

This file contains non standard HTML comment(s)!
File could not be completely converted, since non standard HTML comments were found. If this is the case, try using BADHTML option.

Include file could not be added.
File specified with INCLUDE option can't be added because it doesn't exist or it's not accessible.
Check file name and path.



FAQ (advises, interfacing to other programs and more)

Q. "ANSI styles (bold, italic, underline, blue) stop after first line."
A. See the notes about ANSI conversion.

A. "Converted text isn't centered, but in the original document it is."
Q. This can happen if the text in a table row (<TR>) or cell (<TD>) is defined as centered. To maintain compatibility to some programs used with HTTX, this version doesn't yet supports alignment defined in those elements. This will be added in future versions that will have more table support.

Q. "Sometimes alignment doesn't work, wordwrap and lists are not correctly formatted or HTML TAGS are shown".
A. It's the text included between <PRE> TAGS. HTTX copies the text as is, without formatting. This choice was made because often that kind of text contains sources that the author probably wishes to keep as is.

In <LISTING> and <XMP> the TAGS are leaved as they are, as specifications for those elements define. Although its use is deprecated in HTML 4.0, <XMP> is still largely used for examples in many documents. eg. in the Netscape JavaScript specifications.

Q. "Some pages are not correctly converted..."
A. There could be many reasons: layout based on tables (not fully supported, see "What is supported" section), errors on HTML source (HTTX is quite tolerant, but there are limits) or errors on HTTX engine. If you think the page is correct, send me an E-mail with the URL.
(E-Mail: favrin@tin.it, FidoNet: 2:333/726.8)

Q. "How can I directly use HTTX from AWeb or Directory Opus?"
A. AWeb users of version 3.1 or better can use the enclosed Arexx plugin.

HTTX can be used from Directory Opus by creating a button configured as follow (Directory Opus 4.12):

New Entry/AmigaDOS:
C:HTTX {f} {d}

(replace C: with path for HTTX)

With this configuration, a file selected from "source" directory will be converted to text and saved to the "destination" directory.

By activating "Do all files" flag it's possible to convert more than one file, by selecting them and clicking the HTTX button.

Q. "How can I improve the performance of HTTX?"
A. To speed up the conversion, try using a filesystem with block of 1024 bytes, like RAM disk. Note that if memory is almost full or fragmented, saving to RAM disk will may slow down the conversion process.



Technical informations

This section talks about some thematics of HTML and its implementation in HTTX. Although reading this is not required to learn how to use HTTX, there is important information about conversion that you should read if you plan to distribute your converted texts.



What is supported, what is not (yet) and implementation

Supported HTML:

What is (not yet fully) supported:

Implementation of the standard:

These rules are followed except in <SCRIPT>, <PRE>, <LISTING> and <XMP> elements). See "notes about conversion of <PRE>, <XMP>, <LISTING> and <SCRIPT> contents" for more informations about this.



Notes about ANSI conversion

If ANSI option is enabled HTTX uses standard ANSI escape sequences for converting HTML styles (such as bold, italic, underlined and so on...), links (rendered as underlined blue), centering and indentation of text.

The ANSI codes used are taken from standard ANSI specifications and should be supported by any program (maybe...).
These are the sequences (ESC is replaced by "\e"):

Bold
\e[1m
Italic
\e[3m
Underlined
\e[4m
Blue
\e[33m

Multiple ANSI definition is done using the ";", ie. to set bold and italic HTTX uses "\e[1;3m".

For list indentation and text alignment, HTTX uses the cursor position sequence "\e[nnC" where nn is the number of characters to move right. This sequence is not used when printing.

Compatibility problems:



Notes about conversion of <PRE>, <XMP>, <LISTING> and <SCRIPT> contents

Rules specified for implementation of the standard, wordwrap of text and 7bit conversion of 8-bit characters aren't totally valid for some elements. The HTML specifications require them to be treated differently.

TAG <PRE> (preformatted text)
In this mode wordwrap is not done. Parsing of HTML TAGS (except lists and alignment) and entities is done. Numeric entities lower than 32 are unchanged to avoid problems with uuencoded files contained in HTML pages.

TAGS <XMP>, <LISTING> and <SCRIPT>.
The contents of these tags are left unchanged. No wordwrap, entities conversion, or TAGS parsing is done at all. If the 7BIT option is used, ASCII characters lower than 32 are converted to spaces. ASCII characters higher than 128 are left as they are and Win'95 entities are not remapped.

Remember this if the converted text will go on message areas, like Fidonet or Usenet newsgroups!



How to contact the Author

Beyond every communication, problem, bug report, advise or other things, a comment about HTTX will be appreciated, and information of actions toward corporations who takes care of children (see "CHILDWARE").

E-Mail : favrin@tin.it
FidoNet: 2:333/726.8

Please write me in italian or english, thank you.

HTTX support page:
http://www.aspide.it/freeweb/poing/soft/httx/index.html



Greetings

Beta testing of version 1.7:

HTTX 1.5 english documentation and spell check of 1.7:

Beta testing of AWeb Plugin:

Very much thanks to Yvon Rozijn for having wanted HTTX inside AWeb II and for all the help he gives me!

Thanks to W. v. Oortmerssen for the splendid AmigaE, used to realize HTTX.

Finally special greetings to those who wrote me about HTTX and to those who use it!



Program history

V1.0 (July 1996)

V1.1 (November 1996)

V1.1a (January 1997)

V1.1b

V1.5 (May 1997)

V1.7 (January 1998)



Future versions

HTTX is a program in continuous growth, because I daily use it, I notice lacks or possible improvements. This version will be the basis for a new even better version that will be out... soon.




*** CHILDWARE ***

This software is "CHILDWARE". The author explicitly asks whoever uses this program to make a donation toward a beneficial corporate body which works in help of the children, in any form or way.

If you don't know any, ask your local post office and inform yourself on how to make a payment to The UNICEF.

The amount of the offer is up to you , but please do it!